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Abstract 

Linear regression models depend directly on the design matrix and its proper¬ 
ties. Techniques that efficiently estimate model coefficients by partitioning rows of 
the design matrix are increasingly popular for large-scale problems because they fit 
well with modem parallel computing architectures. We propose a simple measure 
of concordance between a design matrix and a subset of its rows that estimates 
how well a subset captures the variance-covariance structure of a larger data set. 
We illustrate the use of this measure in a heuristic method for selecting row parti¬ 
tion sizes that balance statistical and computational efficiency goals in real-world 
problems. 


1 Introduction 


A common procedure in supervised learning problems when the number of rows of 
data is large and far exceeds the number of columns is to partition the rows, fit models 
on individual partitions, and combine them by averaging or other aggregation into a 
single m odel. This computa tional approach is referred to as “Divide and Recombine” 
(D&R) in Guha et al. ( 2012h and has gained wide use i n part because it can be described 
as a single MapReduce (IDean and Gh emawal . 12008 ) step and is easily im plemented 
in software framewo rks like Hadoop d Apache Software Foundation , 2014 ) and Spark 
( Zaharia et all 2010l) . 

Partitions are constructed either by conditioning-variable division or replicate di¬ 
vision. The former adds samples to a partition based on one or more of the variables 
in the data. For categorical variables this generally means one partition per level and is 
equivalent to a variable interaction between the categorical variable being conditioned 
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on all other regressors in the model. For continuous variables this means one partition 
per range of values. Alternatively, data can be partitioned along multiple conditioning 
variables. Replicate division creates partitions using random sampling without replace¬ 
ment _ 

Guha et all ( 2012 ) provide conditions such that a model constructed by averaging 


coefficients from ordinary least squares models along replicate division data partitions 
converges asymptotically to the single ordinary least squares model fitted to all of the 
data, called the reference model. We constrain our attention to D&R under replicate 
division to create a single, averaged model. 

Note that the D&R method is related to boosting since in both cases an ensemble 
of models is created by sampling from the original data set. There are differences, 
however. First, when boosting samples with replacement the D&R averaged model 
samples without replacement. Second, where the result of training a boosted model is 
an ensemble of individual learners the result of training a D&R model is a single model 
where parameters have been averaged together. 

D&R describes a general computational approach to large scale regression and clas¬ 
sification requiring only a single pass through the data set. This gives D&R a compu- 
tational advan tage over those models which require multiple passes through the data. 
MatloffI ( 2014 ) extends both the statistical results by providing a broader class estima¬ 


tors under which D&R converges as well as the complexity analysis by showing that 
the computationa l speed up can be grea ter than the number of parallel processes applied 
to each division. iKleiner et al\ (1201 ll) also developed the statistical theory further by 
identifying a general class of models that converge under the random partition assump¬ 
tion as well as showing that bootstrap models can be created on each partition and the 
ensemble of models over all partitions converges asymptotically to a single bootstrap 
model over the entire data set. 

There are at least two practical issues to consider when implementing the D&R 
approach to model fitting with replicate division on distributed data. First, partitions 
should consist of random sets of rows to ensure that the aggregate model converges to 
its reference. This is important in real-world, finite-sized problems to avoid artifacts 
related to ordering in the data, for example from the data collection process. If sam¬ 
ples are arranged in partitions, without randomization, then individual partitions may 
capture information that is drastically different from the other partitions or the total 
population. As a result, an aggregate model can perform poorly when compared to its 
reference. 

The second challenge is in deciding the number of samples per block. The con¬ 
vergence results show that the ensemble of models converges to the reference model 
asymptotically in the number of random samples in the partitions. This implies that, to 
minimize the difference between the ensemble and the reference, block sizes should be 
as large as possible. However, the amount of speed-up achieved is generally directly 
proportional to the number of partitions (assuming each block is allocated its own pro¬ 
cess). Therefore, there is a trade-off between the statistical consistency of a model 
created using the D&R approach (with respect to its reference) and the parallelism that 
can be achieved when fitting the model. 

The regression techniques presented work on subsets of disjoint rows and, with 
this observation, a natural subsequent question is how well do these individual subsets 
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represent the data set as a whole? This is a relevant question since if we have a small 
representative subset then we may not need all of the data. We may be able to fit a 
model more efficiently by only using the representative subset. If we hnd that some 
subsets differ drastically from the rest of the data set we may need to investigate these 
subsets further. They may indicate different underlying relationships that can be con¬ 
ditioned upon or they may indicate data integrity issues. In either case they motivate 
further investigation. 

This paper explores the trade-off between statistical consistency and computational 
efficiency by introducing a statistic that estimates the concordance between a subset 
and the entire data set based on their respective variance-covariance structures. We 
provide reproducible experiments that illustrate our concept of concordance and its 
use along with benchmark results. A hnal section is devoted to discussion and future 
directions. 


2 Motivating Case: Least Squares Regression 

Consider the following example linear model and corresponding least squares problem: 

i" = X^ + e, (1) 

where Y,eG 7?.", and /3 G TZ^, n > d', each element of e is an i.i.d. random variable 
with mean zero; and X is a matrix in with full column rank. The ordinary 

least squares problem is posed as /3 = argmin^ ||X/3 — V|| and has the closed form 
analytic solution dehned by the normal equations: 

(x^x)”^x'^y. (2) 

We remark that computed solutions rarely use Equation|2]directly but rather use QR or 
SVD decompositions of X for numerical stability. Equation |2] remains important for 
analysis purposes. Consider the row-wise partitioning of Equation[T] 
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where Yi, > 2 , ■ • ■, Xi, X 2 ,..., X^ and ei, £ 2 , ■ • ■, £r are data partitions such that each 
block is composed of subsets of rows of a data set, blocks are disjoint, and the aggregate 
of all blocks is the original data set. Without loss of generality we assume that n/r = c 
is an integer so that the number of samples in each submatrix is the same. The blocks 
may, for example, be located in hies across a network of computers for distributed 
computation. 

When each of the X^’s represent a random partition of X then the estimate for least 
squares regression coefficients using D&R averages the block-wise estimates of the 
slope coefficients Pi, 

ft = (XfX,)''xfY„ (3) 
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Compare the D&R approach to block-wise computation of the (full) least squares so¬ 
lution: 



(4) 


Both approaches are similarly easy to compute in parallel. The overall computational 
cost of both approaches is 0{(P), although the D&R approach (Equation^ has a larger 
constant term by a factor of r compared to block-wise solution of the full problem 
(Equation|4|i. 

Recall that the D&R solution P is an estimate of the true solution jS. It’s reasonable 
to ask the why one would spend more computational effort to produce only an estimate 
of a solution, when a similarly easy direct solution method is available. Despite the 
apparent advantage of the direct block-wise solution method shown in Equation]?] the 
D&R approach is potentially superior in two ways: lower network communication cost 
in parallel computing settings and better numerical stability. We outline each advantage 
below. 

Lower network communication cost Assume that the problem is distributed so that 
each data block i = 1,2,... ,r is located on a different computer in a network. The 
D&R approach shown in Equation|3]averages r sets of d model coefficients to produce 
an averaged output model, for a total rd numbers transmitted between computers over 
the network. Block-wise solution of the full problem outlined in |4| by comparison, 
sums r partial d x d matrix products and r vectors of length d, for a total of rd^ + rd 
numbers to transmit over the network. The communication cost for the full solution 
is expensive when there are moderately large numbers of columns in the model matrix. 
Eor example with 8-byte double precision floating point numbers, r = 100 and d — 
1000, the full problem solution must transmit 800 MB across the network. The D&R 
approach only transmits 800 KB by comparison. 

Better numerical stability. Although we routinely use the normal equations for 
analysis of least squares problems in exact arithmetic, their use computationally is not 
recommended because the matrix X^X is generally less well-conditioned (and never 
better conditioned) than X. Instead, least squares problems are typically computed us¬ 
ing either a QR or singular value decomposition of the model matrix X. Unfortunately, 
parallel computation of a QR or SVD factorization for block row-distributed matrices 
is neither easy nor readily available to most high-level programming languages like R 
and Python, especially in MapReduce-like computing setting^ Note also that model 
matrices involving contrast variables derived from so-called factor variables are often 
sparse. SVD and QR decompositions destroy sparsity patterns and the resulting fac¬ 
torized model matrix may consume much more memory than the original. Distributed 
computation of the full problem is typically performed using the normal equations as 
shown in Equation |4| for these reasons. The D&R approach, by contrast, is free to 
use numerically stable solution methods in each block. Moreover, since the blocks 
are relatively small, loss of a sparse model matrix representation in each block due to 
factorization is more tolerable. 

* A notable exception is HPC systems using ScaLAPACK and MPI-see the R pbd package, for example- 
but these are usually very specialized systems. 
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We illustrate our point about numerical stability with a simple example. 
X = ( 1, /3 = ( " I , and, y = XI3. 


Note that X is an ill-conditioned matrix, but not so ill-conditioned to prevent numerically- 
stable techni ques from working. T his can be demonstrated using the R programming 


environment (IR Core TeamL 12014) to compare least square solutions of this example 


computed using a stable technique and using the normal equations. 

> X <- matrix! c(le9,-1,-1,le-5), 2) 

[, 1 ] [, 2 ] 

[1,] le+09 -le+00 

[2,] -le+00 le-05 


> y <- X %*% c(l,l) 

One stable least-squares solution approach gives us the expected solution: 

> qr.solve(X,y) 

[, 1 ] 

[ 1 , ] 1 

[ 2 , ] 1 

Now with the normal equations. Note that in R we form the inverse of X"^X as 
described in Equation|2] 

> qr.solve(t(X) %*% X) %*% t(X) %*% y 

Error in qr.solve : singular matrix 'a' in solve 

An informative error. If we override the error, we get a bad result that is very different 
from what we expected: 

> qr.solve(t(X) %*% X, tol=0) %*% t(X) %*% y 

[, 1 ] 

[1,] 0.9999995 

[2,] 1080.4998042 

The example, although pathological by design, shows that using the normal equa¬ 
tions directly to solve least squares problems can lead to failure or poor results. Even in 
better-conditioned problems, use of the normal equations will result in a loss of numer¬ 
ical accuracy in the solution. The D&R method is free to employ numerically-stable 
techniques in each block. 

The normal-equation approach does have the advantage of being easily updatable, 
data are received they can be added to a new block. If the block-coefficients are stored 
then updating the model is simply a matter of getting the estimate for the new block 
and once again averaging over all of the coefficient estimates. The removal of data 
or the updating of an existing block can be handled similarly. The D&R method is 
only updatable when the incoming data are guaranteed to be distributed at random, that 
is there is no local correlation structure that exists only a new block. The effect of 
non-randomness is explored further in Section l6^ 
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3 Common-basis Concordance 


The previous section shows the trade-offs and characteristics of estimating the slope 
coefficients in a ordinary linear regression using the normal equations along with D&R. 
In both cases the slope coefficients are a function of the variance-covariance structure 
among regression and regressor variables and it seems reasonable that fitting subsets 
a subset of data could give similar results and require less computational complexity. 
This section introduces concordance as a statistic for capturing how well a subset of 
data represents the whole. For this paper, a data set is “representative” of a larger 
data set if the correlation structure of the two data sets are similar and we are most 
interested in the case where one data set is a small subset of the rows of a larger data 
set. In Section|7]other potential applications and directions are proposed. 

Let A and B be n x d and m x d matrices respectively The previous section is 
meant to motivate with m,n> d', let A and B^B have eigenvalue decompositions 
V^AaV and V^Ab V, V'^V = I, respectively; and assume (A’^A)~^ and (B^B)~^ 
exist. The concordance of these two matrices is defined as 

5(A,B) = £-||ABt||^, (5) 

where B^ = (B^B)^^B^ is the Moore-Penrose matrix pseudo-inverse of B, and 
II • ||b is the Frobenius matrix norm. This measure essentially compares the variance- 
covariance structure between two matrices. A concordance value of one results when 
A^A = B^B. The concordance value is less than one when there is, on average, 
less variance in A than B, and the concordance value is greater than one in the reverse 
case. Statistical characteristics of common-basis concordance will be derived in the 
next section. The rest of this section derives some deterministic characteristics. 


Proposition 1. If the common-basis concordance conditions specified above hold, then 


|ABt||; = E 


Aa(0 


where XAii) cind Ab(*) are the ith eigenvalues o/A^A and B^B respectively. 

Proof. 

11 ABt 11^ = tr (b (B^B)A'^A (B^B) B'^) 

= ti-(B^B (B^B)^^ A^A(B^B)"^) 

= tr(A^A(B^B)~^) 

= tr (V^AaVV^AA^V) 

= tT{YY^XAXB^) 

= tr{XAXg^) 

Steps|2]and|9]follow from the cyclic permutation property of the trace. 


( 6 ) 


(7) 

( 8 ) 

(9) 

□ 
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Corollary 1. The matrix concordance remains unchanged if both A and B are right- 
multiplied by any orthonormal matrix, in particular the eigenvector matrix V. 

The corollary follows directly from II AV(BV)t III = ||AVV^Bt||| = ||ABt|||. 

Proposition[T]and Corollary[T]require that eigenvectors of the scatter matrices of A and 
B are the same to show that the concordance of their variance-covariance structure 
only depends on their eigenvalues. 

The matrix concordance for matrices sharing a common scatter matrix eigenvector 
basis is the ratios of the eigenvalues of their scatter matrices. Applied to data-analytic 
statistical challenges, concordance between two data matrices can be analyzed to see 
if both are samples from the same distribution. Furthermore concordance allows us to 
estimate how well a sample from a data set captures the variance-covariance structure 
of the entire data set. 


4 Deriving the Ratio Distributions from the Concor¬ 
dance Statistic 


The previous section proposed a measure of concordance between two matrices with 
the same variance-covariance structure based on the Frobenius norm of one matrix 
normalized by the pseudo-inverse of the other. Furthermore, it was shown that the 
proposed concordance measure is preserved when each of the two matrices is right- 
multiplied by the eigenvector matrix of the shared variance-covariance matrix. In this 
section, matrices will be assumed to be drawn at random from a specified distribution 
and the concordance’s distribution will be derived. 

In particular, it will be assumed that the data to be analyzed is drawn from some 
distribution with zero mean and known variance-covariance matrix E. A data set with 
n samples will be denoted X[„] indicating that the data set is made up of the set of 
samples from 1 to n. By introducing this absolute index to the samples we can easily 
express the first i samples in X[„] as X[j] for i < n. Likewise, we can express all of 
the samples except the first i as X[„]\[j]. When the concordance is calculated between 
an entire data set and a subset it will be referred to as overlapping and when the data 
sets are disjoint they will be referred to as non-overlapping. 

The concordance S (X[j], X[„]) normalizes the variance-covariance matrix of X[j] 
by X[„]. By Corollary [T] this is equivalent to projecting the data onto the common 
orthonormal column basis. The resulting orthonormalized variance-covariance matrix 
has zero expected values for all off-diagonal elements. The diagonal elements are the 
ratios of the common variance estimates, which reduce to a sum of random variables 
centered at one with standardized dispersion. 

Proposition 2. Suppose that X[„] is sampled from an i.i.d. multivariate normal distri¬ 
bution. Then for sufficiently large d 



where Af is a normal distribution with specified mean and variance parameters. 


1 



Proof. The normed matrices in Equation |5] reduce to d sums of n squared standard 
normals. Furthermore i of the samples are repeated in X[i] and X[„]. Then the concor¬ 
dance can be expressed as; 



( 10 ) 




d 1 72 

—> • / ,z-1 


d 1 72 

—^ . / .z—1 


i 72 

n 2^k=l '^k 


( 11 ) 


where M = X[„] V and aj is the jth eigenvalue of E and Zk is distributed as standard 
normal. Each of the d terms in the summation are a ratio random variables. The 
degrees of freedom in the numerator and denominator are equal to i and n respectively 
with i of the sampled random variables appearing in both the numerator and denom¬ 
inator. The ratio of distributions, where samples are repeated in the denominator, 
is distributed as Beta. The result follows by applying the central limit theorem to the 
sample mean of d independent Beta distributions each of which are multiplied by the 
constant n/i. □ 

Proposition 3. Assume the same conditions as in Proposition^along with the added 
condition that n » i >> 2. Then 



Proof. By definition, the concordance is the same as in Equation fTOl except that the 
summation in the denominator goes from i -f 1 to n. As a result we get a result similar 
to Eauationfm where 



( 12 ) 


The ratio of independent -yf distributions, where the numerator and denominator are 
normalized by their respective degrees of freedom, is F with i and n — i. The result 
follows by applying the central limit theorem to the sample mean of d independent F 


distributions. 


□ 


Proposition 4. Suppose that X[„] is sampled from some distribution with constant 
mean and variance-covariance. Let M be X[„]V as before. If 1 < j < d, Z is 
standard normal, and the following joint convergence holds 



1 


n 


1 
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Then the concordance is approximately distributed as Cauchy with location parameter 
1 and scale parameter \/(n — i)/i. 


S (X[i],X[„]\[i]) Cauchy M, \J-—— 


(13) 


Proof. 


-3 (.X[,],X[„]j = -12^ — 








d ~^i (TjZly/n^i 


1 \Jn — iZ 
d \/iZ 


(14) 


Each term in the summation is a ratio of two normal random variables. Equation [14] 
follows from the joint convergence assumption (Equation|4]i and the application of the 
continuous mapping theorem. The ratio of two normally distributed random variables, 
with common location parameter, is distributed as Cauchy. The result follows by real¬ 
izing that the sample mean of independent Cauchy distributions, with common location 
and scale parameters is Cauchy with the same location and scale parameter. □ 


The scale and location values based on the distributional results are shown in Table 
[T] It may be noted that, while the Cauchy approximation relies on fewer distributional 
assumptions it is also less applicable. In particular its support includes the negative 
reals. This is a result of taking the ratio of central limit theorem approximation of 
two random variables, as shown in Equation[T4| However, this derivation may suggest 
that, when the normal distribution assumptions do not hold the resulting concordance 
distribution is heavy-tailed. 


Model 

Location 

Scale 

Approx. Concordance 
Distr. 

If Beta 

1 

2{n—i) 

i(n+2) 

-V(l,2!S4) 

F (i,n — i) 

n—i 

2{n—i)‘^{n—2) 


n—i—2 

2 (n— 2 —2)^(n— 2 —4) 


Cauchy 

1 


Cauchy 


Table 1; The derived model distributions with their location and scale parameters. Note 
that for the F and Beta model the approximate concordance assumes n >> i » 2. 
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5 Benchmark Description, Design, and Implementation 


To assess the behavior of the relative stability measures propos ed in Equatio n |5] this 
section makes use of the “Airline on-time performance” data set ( RITaL 2009li . which 
was released for the 2009 American Statistical Association (ASA) Section on Statisti¬ 
cal Computing and Statistical Graphics biannual data exposition. The data set includes 
commercial flight arrival and departure information from October 1987 to April 2008 
for those carriers with at least 1% of domestic U.S. flights in a given year. In total, there 
is information for over 120 million flights, with 29 variables related to flight time, de¬ 
lay time, departure airport, arrival airport, and so on. In total, the uncompressed data 
set is 12 gigabytes (GB) in size. Benchmarks in this section focus on the model matrix 
representation of the of the variables shown below in Table|2l The model matrix under 
consideration will use the treatment-contrast expansion of the categorical variables and 
has a total of 43 columns (d = 43). 


Variable Name 

Description 

Type 

Number of categories 
(if applicable) 

Year 

The year of flights 

categorical 

22 (1987 to 2008) 

Month 

The month of flights 

categorical 

12 

DayOfWeek 

The day of week of flights 

categorical 

7 

DepTime 

The departure time of flights 
(minutes after midnight) 

numeric 

NA 

DepDelay 

The departure delay of flights 
(minutes) 

numeric 

NA 


Table 2: Variables that will be considered for the covariance matrix stability bench¬ 
mark. 


The 12 GB Airline On-time data set will likely not be considered “big” to many 
readers. Papers such as iKane et al.\ (1201311 have shown how the data set can be explored 
and analyzed on relatively modest hardware. However, in designing the benchmarks 
two principles were considered before sheer data size. First, the data set is publicly 
available. The code included in the Supplemental Material of this paper is capable of 
downloading the data set and running the benchmarks. Users are encouraged to engage 
the data themselves and perform their own analyses. Second, the data set is large 
enough to investigate the scatter matrix concordance properties along with the scaling 
behavior of the various regression techniques described in this paper. Together, the data 
set and the code available with this paper provide a set of accessible and reproducible 
benchmarks that form a basis for instruction and subsequent research. 

The benchmar ks presented in the n ext section were written in the R program¬ 
ming environment (IR Core Teaml 1201411 and can be run on a single machine sequen- 
tially, a single machine in parallel, or on a cluster of machines using the foreach 
dRevolution Analytics and Westonl 2014 ) package, which provides a concurrent inter¬ 
face to a number of different parallel computing technologies for embarrassingly par¬ 
allel challenges. The implementation also makes use of the iterators and itertools 
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( Revolution Analytics . 2014 : Weston and WickhamL 2014 ) packages thereby decou¬ 
pling data access from retrieval and management. The most accessible and straightfor¬ 
ward implementation approaches are used to illustrate the methodological principles 
of the models and their implementation. The benchmark implementation is flexible 
enough to be deployed to any number of different data management, communication 
protocol, memory, and processing conhgurations. It is also easily modihable to accom¬ 
modate alternative technologies and methodologies. 


6 Benchmark Results 

This section provides benchmarks exploring the convergence of the concordance to 
one on an increasingly larger subset of the Airline On-time Data set alongside the con¬ 
vergence of the slope coefficients of a Generalized Linear Model (GLM) of the same 
subsets. These two sets of benchmarks establish an empirical connection between re¬ 
gression and concordance. However, it should also be noted that since the concordance 
approach does not distinguish between independent and dependent variables it pro¬ 
vides a diagnostic not only for the regression but for any regression involving the same 
variables. Furthermore, since the concordance calculation for a subset of the variables 
can be found directly from the corresponding scatter matrices a diagnostic easily be 
calculated for any regression involving any subset of the variables. 

6.1 Random Sampling Similarity Convergence 

The hrst set of benchmarks take random samples of varying sizes from 10 to 5000 
from the Airline On-time data set and calculates the overlapping and non-overlapping 
concordance between the random subset and the entire data set. The overlapping and 
non-overlapping values were equal up to seven decimal places and so only the overlap¬ 
ping concordance is reported. 

Figure [T] shows the convergence of the concordance to the value of one when con¬ 
cordance is calculated using both Equation Indirectly along with the trace-equivalent 
version in^ The plot shows that both versions capture the variance-covariance struc¬ 
ture after only several thousand samples (out of a total of approximately 120 million). 
Furthermore, it can be seen that the trace-equivalent version is slightly smaller than 
the direct calculation. This is likely because the direct calculation is more sensitive to 
noise in the correlation terms of the scatter matrix and this may account for the over¬ 
shoot when the sample size is zero and the small overshoot when the sample size is 
1 ,000. 

Each concordance-distribution derivation reduces to the average of d similarly dis¬ 
tributed random variables. The terms summed in the concordance measure can there¬ 
fore be thought of as samples, and their distribution can be examined. This distribution 
information is shown in Eigure |2] with varying sample size. The plot again shows lit¬ 
tle difference between the direct calculation and its trace-equivalent, especially after 
a few thousand random samples are taken and the variance-covariance structure be¬ 
comes known. The trace equivalent does appear to be slightly positively skewed when 
the number of samples is smaller and then appears to be essentially symmetric, like the 
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Figure 1: Convergence of concordance to one in the number of samples using random 
sampling. 
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Figure 2: The distribution of overlapping and non-overlapping concordance values in 
the number of samples using random sampling. 
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Figure 3; Convergence of concordance to one in the number of samples using non- 
random sampling. 


direct calculation of the concordance. The figure also indicates that the concordance 
is not heavy-tailed. The 95% percentile converges relatively quickly in the number of 
samples. 

6.2 Non-Random Sampling Similarity Convergence 

The second set of benchmarks take contiguous samples, starting at the beginning of 
the Airline On-time data set, varying the size from 10 to 1,000,000 to empirically 
determine the effect of not using randomized subsets when calculating concordance. 
Once again, the overlapping and non-overlapping concordance values were essentially 
identical and only the overlapping concordance is reported. 

Figure[3shows the convergence of the concordance values, once again using Equa¬ 
tions |5] and |6] Where both versions had nearly converged to one after only a few 
thousand samples, non-random sampling requires millions, with the trace-equivalent 
concordance lagging the direct version. 

Convergence in concordance is slow in the non-random case and this may be due 
to the fact that that the model matrix corresponding to the factor expansion does not 
include a sample of all of the information from the Year variable based on Figure 
[3 The poor convergence characteristics extend to the distribution of the concordance, 
as shown in Figure [3] thereby underscoring the random sampling in order to achieve 
representative subsets. 
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Figure 4: The distribution of overlapping and non-overlapping similarities in the num¬ 
ber of samples using non-random sampling. 


6.3 Concordance and GLM Convergence 

The third set of benchmarks compares the concordance with the log mean square error 
between the slope coefficient estimates fitted using a random subset of the data and 
estimates fitted using the entire data set. The variable relationship under investigation 
is the following logistic regression: 

Late ^ Year + Month + DayofWeek + DepartureTime + Departure Del ay 

where a flight is ‘‘Late” if its arrival is at least 30 minutes after its scheduled arrival. 
The other variables are described in Table |2] For each subset size, the procedure was 
repeated 10 times and the average concordance and log MSE of the coefficients were 
recorded. 

Figures |5] and |6] show the convergence behavior of the concordance to one and log 
MSE of the slope coefficients to zero respectively. When the number of samples is 
small with respect to the number of columns in the design matrix the concordance is 
low indicating that the variance-covariance structure is not well-represented and the log 
MSE of the slope coefficients is high indicating that the estimates of the coefficients is 
poor. These values quickly increase and decrease respectively until approximately 500 
samples where both the concordance and log MSE slope decreases. 

The slow convergence after 500 samples of both measurements implies that for a 
precise estimate of the slope coefficients, and corresponding concordance value close 
to one a larger subset will need to be fitted. When the subset is 1,200,000 samples 
(about 1% of the total data size) the concordance values is 0.9820691 and the log 
MSE is -2.788843. At 12,000,000 samples (10% of the data) the concordance value 
is 0.9822198 and the log MSE is -4.902919. This “slow” convergence in the number 
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Figure 5: Convergence of concordance to one in the number of samples using random 
sampling. 
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Figure 6: Convergence of log MSB between slope coefficient estimates using random 
subsets and the entire Airline On-time data set. 
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of samples may indicative of complex relationships among the variables that require 
larger samples to capture. 

7 Conclusions 

Regression models depend directly on the model design matrix and its properties. This 
paper attempts to bridge the “small-data” and asymptotic behavior of regression models 
by proposing a simple measure of concordance between a design matrix and a subset of 
its rows that estimates how well a subset captures the variance-covariance structure of 
increasingly large data sets. We illustrate the use of this measure in a heuristic method 
for selecting row partition sizes that balance statistical and computational efficiency 
goals in real-world problems. 

Our future work in this area will focus on data fusion. In many cases it may be desir¬ 
able to combine two data sources with analogous measurements to increase the power 
of statistical experiments. Concordance provides a distance between the variance- 
covariance structure with known distributional characteristics. Tests for equivalence 
can then be used to rigorously assess the appropriateness of combining data sources 
thereby allowing practitioners to make better use of existing, potentially under-powered 
data. 
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